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Abstract 

The Field Estimator for Arbitrary Spaces (FiEstAS) computes the con- 
tinuous probabihty density field underlying a given discrete data sample in 
multiple, non-commensurate dimensions. The algorithm works by construct- 
ing a metric-independent tessellation of the data space based on a recursive 
binary splitting. Individual, data-driven bandwidths are assigned to each 
point, scaled so that a constant "mass" Mq is enclosed. Kernel density esti- 
mation may then be performed for different kernel shapes, and a combination 
of balloon and sample point estimators is proposed as a compromise between 
resolution and variance. A bias correction is evaluated for the particular 
(yet common) case where the density is computed exactly at the locations of 
the data points rather than at an uncorrelated set of locations. By default, 
the algorithm combines a top-hat kernel with Mq = 2.0 with the balloon 
estimator and applies the corresponding bias correction. These settings are 
shown to yield reasonable results for a simple test two- dimensional 

ring, that illustrates the performance for oblique distributions, as well as for 
a six-dimensional Hernquist sphere, a fairly realistic model of the dynamical 
structure of stellar bulges in galaxies and dark matter haloes in cosmological 
N-body simulations. Results for different parameter settings are discussed in 
order to provide a guideline to select an optimal configuration in other cases. 
Source code is available upon request. 
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1. Introduction 

Given a point process where the D- dimensional probabihty density field 
/(x) is sampled by random points X,, the goal of density estimation is 
to infer the continuous function /(x) from the discrete set of Xj. One of 
the most popular approaches to the problem is kernel density estimation, in 
which the field is estimated by 



where the kernel K{u) is an even function that integrates to unity, and the 
bandwidth H is a DxD matrix that specifies the scale, shape, and orientation 
of the kernel. The choice of this matrix has been thoroughly discussed in 
different contexts, and extensive reviews exist in the literature [e.g. HI, [ij. 

The importance of density estimation cannot be overstressed. Quite of- 
ten, one is directly interested in the density itself; the FiEstAS algorithm 
was originally developed [sj to evaluate the density of particles in the six- 
dimensional phase space of positions and velocities. Although the problem 
has recently arisen considerable interest [e.g. 0, Is], 0, 0], it is of course only 
an anecdotical example. Nevertheless, it illustrates the difficulty of defining 
a metric (and related concepts, such as neighbourhood) in the general, non- 
Euclidean case. Although distances can be trivially defined in both three- 
dimensional subspaces, it is not clear how positions and velocities should be 
combined in order to produce a meaningful six-dimensional distance. It can 
be shown that a global scaling will only be appropriate for a certain region 
of the phase space, but not for the whole system [see the discussion iny, ITJ. 
In other words, the metric must adapt to the local structure of the data in 
order to recover the underlying density field. 

In terms of applications, density estimation can be helpful in data min- 
ing problems. Unsupervised classification may be performed by identifying 
independent clusters with local density maxima, with boundaries set by the 
saddle points. In supervised classification, one can compute the probability 
distribution for each group c in the training set, /c(x), from the N^. data 
points belonging to it. Applying Bayes' theorem, the probability that a new 
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datum X belongs to class c is given by 



p(e|x) = (2) 

where tTc denotes the prior probability of each class, and the sum in the 
denominator runs over all classes. 

This work discusses the implementation of kernel smoothing in the Field 
Estimator for Arbitrary Spaces (FiEstAS). The algorithm is fully described 
in Section [21 and the results of benchmark tests are presented in Section [31 
The main conclusions are summarized in Section [H 



2. Description of the algorithm 

FiEstAS provides, for a given dataset {Xj}j^^^ in D dimensions, the 
value of /(x) at any arbitrary point x. The algorithm involves the following 
steps: 

1. Tesselation of the D-dimensional space. 

2. Assignment of bandwidths to every data point. 

3. Estimation of /(x). 

4. Bias correction (if necessary). 

Each of them is described below, along with the different options and 
parameters that apply in each case. 

2.1. Tesselation 

The first step of the algorithm is the division of the data space in cells con- 
taining exactly one point. An important issue is the absence of a well-defined 
metric, which greatly increases the range of applicability of the method. 
Rather than using distances between data points, FiEstAS recursively di- 
vides the space by means of a k-d tree, one dimension at a time, until there 
is only one point per leaf. 

There are several criteria to select the dimension to split at each step. 
The original version of FiEstAS [3|] was fine-tuned to estimate densities in 
phase space, and it used the information that both the position and velocity 
subspaces are Euclidean. Moreover, it was imposed that divisions should 
take place alternatively in each subspace. A significant improvement over this 
scheme, proposed by |8i], is the selection of the dimension with lower Shannon 
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entropy. Such a choice results in more divisions along the dimensions that 
show more structure, and therefore it adapts better to the distribution of 
the data. A very similar scheme was implemented in jof to use FiEstAS in 
the context of Monte Carlo numerical integration: when a tree node has to 
be split, a histogram with B = 1 + a/ Anode bins is built for each dimension, 
from the minimum to the maximum value attained by the corresponding 
coordinate. The log-likelihood for the histogram counts to arise from a 
Poissonian distribution is given by 

B 

Ld = HNnodJ) - N^ode HB) - H'^bJ) (3) 

6=1 

where the indices 1 < d < D and 1 < b < B denote the dimension and the 
bin number, respectively, um is the number of points in each bin, and Anode 
is the total number of points in the node. The dimension with smaller L is 
divided at the point a;spiit = {xi + x^)/2, where x\ is the maximum x of all 
points lying on the "left" side (6 < &spiit) and is the minimum x of the 
points lying on the "right" {h > fegput) side. The bin 1 < 6spiit < -B is chosen 
in order that the number of points on each side is as close as possible to 

A^node/2. 

A crude estimate of the density can be obtained as the inverse of the 
cell volume. As shown in j^, this estimate is very noisy, and it dramatically 
underestimates the density of particles near the boundary of the system. 
This becomes a critical problem in many dimensions, because the fraction of 
points affected quickly approaches unity as D increases. A simple correction 
was applied in |3| to data points at the boundary of the hypercubical domain, 
and a scheme based on the mean interparticle separation was used in to 
adjust the shape of every tree node. In the present version of FiEstAS, such 
a correction is not necessary. 

2.2. Bandwith assignment 

In principle, one should compute the D{D + 1) /2 independent coefficients 
of the bandwith matrix H that minimize the mean integrated square error. 
However, doing that for every single datapoint can be impractical for large 
samples, and a simpler prescription has been adopted. 

First, the bandwidth matrices are constrained to be diagonal. Although 
this is far from optimal when the data are distributed obliquely with respect 



to the coordinate axes [see e.g. [ly, lUl, ll2j, there is a substantial gain in 
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Figure 1: Bandwidth assignment for a given particle (plotted in red) in two dimensions. 
The box on the left panel represents x ± ct, where a is the dispersion vector given by 
expression ([4]). The bandwidths ([5]) yield the box x ± h shown on the right panel, better 
adapted to the local distribution of data points. 

speed, memory consumption, and code simplicity, by reducing the number of 
free parameters. This prescription will work well if the field is well sampled, 
although anisotropic kernels would perform better in oblique regions where 
the sampling is sparse. 

The relation between the D smoothing lengths hd of each point is esti- 
mated from the local dispersion of the data along each axis 



where the index n refers to the A^nei neighbours defined by the FiEstAS 
tessellation. The smoothing lengths are then set to 







with weights 



w, 



'n 




(6) 



This measure is less sensitive to the presence of outhers than the simpler 
prescription hd = cr^ (see Figure [T]) . 

In addition, FiEstAS offers the possibility of imposing a particular met- 
ric to any subspace by specifying a list of dimensions {di}i=i^L and the relative 
scale between them {si}i=i^l- Defining S = Y[i=i and V = Y[i=i ^di, 

V 

hd, =si- (7) 

all other dimensions remaining unaltered. For instance, in phase space one 
could set dimensions di = {1,2,3} (positions) to scale as si = {1.0, 1.0, 1.0} 
and then impose the same Euclidean metric to the velocities, di = {4,5,6}. 
The relation between both spaces is not specified, and can vary freely from 
point to point. 

Finally, the overall scale of the bandwidths is set so that the mass con- 
tained within the hypercube they define is equal to the user-defined param- 
eter Mq. The value of Mq controls the degree of smoothing, and can be 
thought of as a constant (not necessarily integer) "number of neighbours" of 
the smoothing kernel. In order to compute it, each data point (of unit mass) 
is uniformly distributed over its cell, without any boundary correction. 




where Cj(x) = 1 if x lies inside the j-th FiEstAS cell and otherwise, and 
the bandwidths are scaled until rrii = Mq within a 10 per cent tolerance. 
This is the only case in which the mass of the data is distributed like in the 
original implementation of FiEstAS. 

2.3. Field estimation 

At this point, it would be possible to estimate the density as 

N D 

• , J , 'na "'id 
«=1 d=l 

where we have used a "product kernel" K. The current implementation 
includes top hat, K{u) = 1/2, triangular-shaped cloud, K{u) = 1 — and 
Epanechnikov, K{u) = |(1 — m^), kernels, where — 1 < m < 1. 
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Apart from this possibility, FiEstAS can also combine /i^(x) with a 
top-hat balloon estimator 



1 /-x+hKCx) 
/b(x) = / /;,(xo) d^Xo (10) 

lld=l ^hKd{x) A-hK(x) 

based on a local bandwidth 

. N D 

Mx) = ^Eh.^7^^(^^^^) (11) 

interpolated from the individual particle bandwidths hj by using the same 
kernel as in equation ([9]). 

2.4- Bias correction 

In many, if not most, practical applications of the algorithm, one is inter- 
ested in the value of the density field precisely at the locations of the sample 
points, and only fi = /(Xj) is evaluated. As discussed in [8j], a positive bias 
that depends on the chosen kernel and its bandwidth arises in this particular 
case because we are not evaluating the density at a completely independent 
set of locations. The magnitude of this bias can be easily estimated for a uni- 
form probability distribution by considering the average values of //^(Xj) and 
/B(Xj). In a uniform Poissonian distribution, /(x) = /o, all the smoothing 
lengths would be given by 

Mo ^ /o(2/i)^ (12) 

and thus 



d=l d=l ^ 

whereas, for the balloon estimator, 

(/B(x,)) ^ n ^ - - 1) n - irj^ - ^/» (") 

d=i d=i ^ 

Therefore, assuming ^ 1, the algorithm can apply a correction fi = 
juncorrected/^^^^) whcu ouly the fi are requested, where bK = [2K{0) f /Mq 
and 6b = I/Mq. It is important to bear in mind that this correction factor 
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must not be applied in the general case, where the sample and evaluation 
points do not coincide. In particular, it should not be confused with the bias 
arising from the derivatives of / (note that, in fact, the values of b have been 
derived for a constant density), that has not been accounted for due to the 
difficulties associated to the estimation of local derivatives. 

3. Results 

The accuracy of the density reconstruction has been tested in two bench- 
mark two-dimensional ring and a six- dimensional Hernquist sphere. 
We compare the performance of differnet kernels, as well as the scaling with 
the number N of sample points. Regarding the smoothing parameter, Mq = 2 
arguably represents a reasonable minimum, with smaller values yielding re- 
sults (bandwidths and densities) that are dominated by the nearest data 
point. As will be shown below, increasing this parameter reduces the statis- 
tical variance of the estimator at the expense of resolution. A value Mq = 10 
is considered for reference, but higher values may be suitable depending on 
the user requirements, especially as the number of dimensions increases. 

3.1. Two-dimensional ring 

The first distribution is a ring in two dimensions with uniform density 
between an inner and an outer radius of 0.95 and 1.05, respectively, in ar- 
bitrary units. A random realization with 100 sample points is depicted in 
Figure m together with the density field returned by the FiEstAS algorithm 
under different parameter configurations. In all cases, the shape of the ring 
is correctly recovered, although some artifacts arise when the cells of the 
FiEstAS tessellation become extremely elongated. Since these artifacts are 
associated to individual points, they become more evident for large values of 
Mq. As can be seen in the bottom panels, they are completely absent when 
a locally Euclidean metric (arguably the most appropriate for this problem, 
at least globally) is imposed. 

The reconstruction obtained by the top-hat kernel has the obvious draw- 
back of the sharp square edges, and the results obtained with the triangular- 
shaped cloud (not shown) or the Epanechnikov kernel are much more satis- 
factory in that sense. For = 100, the Epanechnikov kernel with Mq = 10 
tends to severely oversmooth the density distribution. When the metric is 
constrained to be locally Euclidean [h^ = hy at every point), the width of 
the ring is systematically overestimated, but the recovered shape is perfectly 
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a) Top-Hat kernel, Mn=-.0 



b) Epanechnikov kernek k'lg=2 



c) Epanechnikov kernek k1p=10.0 



d) Top-hot + hokoon, U^=2 




Figure 2: Density field recovered by the FiEstAS algorithm for a random realization of a 
two-dimensional ring distribution with 100 sample points (blue squares). Colours indicate 
local density, in arbitrary units, and contours enclose 5, 25, 50, 75, and 95 per cent of 
the mass. Dashed lines indicate the true distribution. The metric used on the top panels 
has not been constrained, whereas an Euclidean metric has been imposed on the bottom 
panels. Columns represent the results obtained for: a) top-hat kernel with Mq = 2. b) 
Epanechnikov kernel with Mq = 2. c) Epanechnikov kernel with Mq — 10. d) FiEstAS 
balloon estimator, equation (jlOp . combined with a top-hat kernel with Mq ~ 2. 
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a) Top-hot kernel. Mq=2.0 b) Eponechnikov kernel, Wq=2.0 c) Epanechnikov kernel, Mq=10.0 d) Top-hot + balloon, Mq=2.0 



N=100 
N^IOOO 
N = 10* 




Figure 3: Probability distribution of the variable qi = log for random realizations 

of the two-dimensional ring distribution with N = 100, 1000, 10** and 10^ sample points. 
Columns represent different estimators, and an Euclidean metric has been imposed on the 
bottom panels. 

circular. For the unrestricted metric, the density distribution is deformed 
into a shghtly square shape ahgned with the coordinate axes. This is due 
to the combined effect of the hypercubical FiEstAS tessellation (see 0] for 
a comparison of different schemes) and the diagonal bandwith matrix. As 
a result, kernel shapes in "horizontal" or "vertical" regions tend to be more 
elongated, whereas ~ hy in the "diagonal" regions, causing the "diamond" 
and "square" shapes observed for the inner and outer boundaries of the dis- 
tribution. As stated above, it is in these oblique regions, poorly sampled 
within a smoothing volume, where an anisotropic kernel would certainly pro- 
vide a significant advantage. Finally, combining a top-hat kernel with Mq = 2 
with the balloon estimator (llOp yields a density field that is bracketed by the 
results of the Epanechnikov kernel with Mq = 2 and Mq = 10. 

More quantitatively, the probability distribution of the variable = 
log 77$4 is shown in Figure E] for several values of the number of sam- 
pie points between = 100 and = 10^. The bias (qi) and the variance 
V {Qi) ~ ili)'^ of each estimator are quoted in Table [H Since the density 
could already be properly reconstructed with ~ 1000 points, the prob- 
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AT 
iV 


Top-hat 


Epanechnikov 


T7-i^t-, i\/f in 

Hvpa., Mq = iu 


Top-hat ±balloon 


100 


-0.28 ±0.33 


-0.27 ±0.31 


-0.50 ±0.18 


-0.32 ±0.26 


1000 


-0.10 ±0.38 


-0.09 ±0.35 


-0.17 ±0.24 


-0.11 ±0.29 


10^ 


-0.03 ±0.36 


-0.00 ±0.32 


-0.04 ±0.22 


-0.01 ±0.26 


10^ 


-0.00 ±0.34 


0.03 ±0.30 


-0.01 ±0.21 


0.02 ±0.24 


100 


-0.33 ±0.30 


-0.30 ±0.29 


-0.57 ±0.19 


-0.36 ±0.24 


1000 


-0.12 ±0.35 


-0.11 ±0.34 


-0.15 ±0.21 


-0.09 ±0.25 


10^ 


-0.04 ± 0.34 


-0.01 ±0.31 


-0.05 ±0.20 


-0.01 ±0.25 


10^ 


-0.02 ±0.32 


0.02 ±0.29 


-0.03 ±0.18 


0.01 ±0.23 



Table 1: Average value (q,) and dispersion V (qf) ~ ((Zi)^ of the variable qi = log for 
the two-dimensional ring distribution. Columns show the number of sample points and 
the results of each estimator. Top and bottom rows correspond to the unrestricted and 
Euclidean metrics, respectively. 



abihty distribution of qi for this two-dimensional problem does not change 
much with A^, with the exception of the oversmoothing shown by all estima- 
tors for = 100. The bias correction was of the order of 20 — 50 per cent 
(0.09 — 0.18 dex) in all cases but the Epanechnikov kernel with Mq = 2, for 
which it was about a factor of two. The variance also depends on the choice 
of a specific kernel and smoothing parameter Mq, ranging from ~ 60 percent 
in the Epanechnikov kernel with Mq = 10 to more than a factor of two for 
the top-hat kernel. It may be argued, though, that some of this dispersion is 
indeed physical, in the sense that it reflects the Poisson fluctuations inherent 
to the random realization of the ideal uniform distribution. In other words, 
there really are several clumps in the point distribution, and they are clearly 
visible in Figure [2l If one is interested in the actual physical density of these 
regions, its value should be higher than in those others that happen to con- 
tain less points. If, on the other hand, one is interested in the probability 
density field from which the sample was drawn, some statistical criterion has 
to be devised in order to test whether the fluctuations correspond to real 
variations of the field or are simply due to Poisson noise. 

3.2. Hernquist sphere 

The performance of the algorithm has also been tested by recovering the 



density of a six- dimensional Hernquist sphere [13[. This distribution is often 



used to model the central bulges of galaxies, as well as their dark matter 
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a) Top-hot kernel, Mq=2.0 b) Eponechnikov kernel, Wq=2.0 c) Epanechnikov kernel, Mq=10.0 d) Top-hot + balloon, Mq=2.0 



N=100 

N^IOOO 

N = 10* 

N = 10'' 
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Figure 4: Probability distribution of the variable qi = log j(x^ for the six-dimensional 
Hernquist sphere. On the bottom panels, a three-dimensional Euclidean metric has been 
imposed locally to both the position and velocity subspaces. 

haloes. The density of particles in the phase space of three-dimensional 
positions r and velocities v can be written as 



/(r,v) 



M/a^ 3 sin"^ + - e)(l - 2e){Se^ - 8e - 3) 



47r3 {2GM/af'^ (1 - tf^ 

(15) 

in terms of the dimensionless specific binding energy of the particle 

1 i;2 

(16) 



l + r/a IGMja 

and the total mass M and characteristic radius a of the system. The gener- 
ation of a random realization of this distribution is described in [sj . 

Results obtained for different values of N are displayed in Figure H] and 
Table [21 Overall, they are qualitatively similar to the example discussed in 
the previous section, with only minor differences due to the higher dimen- 
sionality of the problem and the very inhomogeneous nature of the Hernquist 
density distribution. In particular, the bias correction is much more impor- 
tant in six dimensions, reaching values as high as a factor of ~ 6.7 for the 
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AT 
iV 


Top-hat 


Epanechnikov 


T7-i^(^ i\/f in 

H/pa., Mq = iu 


Top-hat ±balloon 


100 


-0.07 ±0.46 


-0.16 ±0.49 


-0.25 ±0.49 


-0.21 ±0.56 


1000 


-0.08 ±0.31 


-0.24 ±0.34 


-0.13 ±0.29 


-0.11 ±0.31 


10^ 


-0.01 ±0.28 


-0.16 ±0.30 


-0.04 ±0.24 


0.01 ±0.22 


10^ 


0.03 ±0.26 


-0.04 ±0.26 


0.03 ±0.20 


0.05 ±0.16 


100 


-0.10 ±0.44 


-0.21 ±0.49 


-0.28 ±0.48 


-0.24 ±0.52 


1000 


-0.12 ±0.29 


-0.26 ±0.33 


-0.15 ±0.28 


-0.10 ±0.27 


10^ 


-0.02 ±0.26 


-0.17 ±0.30 


-0.05 ±0.23 


0.02 ±0.19 


10^ 


0.02 ±0.26 


-0.04 ±0.26 


0.02 ±0.20 


0.06 ±0.14 



Table 2: Average value (qi) and dispersion \/ (qf) — {qt)'^ of the variable qi — log 
for the six-dimensional Hernquist sphere. 



Epanechnikov kernel with Mq = 2. Moreover, many more points are neces- 
sary in order to achieve an adequate samphng, and a clear evolution with is 
now evident in the probability distribution of g^. The negative bias observed 
at low is mostly due to oversmoothing of the central regions, which contain 
the majority of the particles. The Hernquist distribution becomes optimally 
resolved for = 10^ — 10^: the sampling within a smoothing volume be- 
comes close to Poissonian, and the probability distribution of approaches 
the asymptotic for the chosen kernel. As in the two-dimensional ring, the 
specification of a metric based on external knowledge of the problem (in this 
case, hi = h2 = and = = h^) affects the results only mildly. 

4. Conclusions 

Kernel density estimation has been implemented within the Field Estima- 
tor for Arbitrary Spaces (FiEstAS) algorithm, using different kernels and 
opening the possibility of combining sample point and balloon estimators. 
The only free parameters are the specific form of the kernel function (top-hat, 
triangular-shaped cloud and Epanechnikov kernels are provided by default) 
and the smoothing parameter Mq. The bandwidth matrix, constrained to be 
diagonal, is automatically computed for every point. Additional constraints 
can be imposed by the user, but the test cases considered do not suggest 
that this results in a significant advantage. In fact, it has already been es- 



tablished for a wide range of cases [see e.g. Il0| that independent bandwidths 



(arbitrary metric) do not lose power against the Euclidean metric, even if 
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the latter is true. A bias correction must be applied when one is only inter- 
ested in the values of the density field exactly at the sample points Xj. The 
magnitude of this correction depends on the details of the kernel, but it is 
already significant aX D = 2 and tends to increase with dimensionality. 

The optimal choice of kernel and smoothing parameter are, of course, 
problem-dependent. Based on the results presented in the previous section, 
the combination of a top-hat kernel with Mq = 2 with the balloon estima- 
tor given by equation (ITOl) seems to yield a reasonable compromise between 
accuracy (low dispersion) and resolution (small number of points required) 
for any number D of dimensions. This, however, may not hold in the gen- 
eral case, and the user is encouraged to experiment with different options. 
In particular, smaller values of the smoothing parameter Mq are unlikely to 
provide useful results, but larger bandwidths may be helpful in order to re- 
duce the statistical noise of the estimator at the expense of losing information 
about the small-scale structure of the data. The kernel shape has a much 
milder effect, but in some cases (e.g. if exact mass conservation is required), 
a sample point may be preferable to a balloon estimator. In this case, the 
Epanechnikov kernel is optimal for an L2 loss criterion with fixed bandwidths 
[2(], and this would be, in principle, the recommended choice. 
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